Checkout the presentation at: https://youtu.be/cCZ3wZs-RkA Attrition is the voluntary loss of employees, and here at DDSAnalytics, we’re looking to determine the factors that lead to this. We determined that three primary facors are over time, montly income, and job roles. Due to this, predictors for monthly income were also analyzed. It was determined that traveling, total years worked, and job level were significant indicators. Fianlly, it was noticed that sales reps were of the highest percentage when it came to attrition, and mangers and research directors were amoung the most experienced and bringing in the highest monthly income.
DDSAnalytics is a Fortune 100 company that specializes in talent development. A recent effort for their data science team is to determine predicting factors for turn over rates. The interest lies in the top three factors, and whether there are any trends to specific job roles. The following document shows the effort to determined these high interest variables. ## Data Description *** The dataset recieved contained roughly 33 different factors and over 850 observations. The factors ranged from job related attributes such as departments to personal affairs such as marriage. #### Read in the data and load in all the libraries
# load in the libraries
library(stringi)
library(plotly)
library(plyr)
library(class)
library(caret)
library(e1071)
library(MASS)
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
library(tidyverse)
library(shinydashboard)
library(shiny)
library(glue)
library(reshape)
library(dplyr)
library(FNN)
library(gmodels)
library(psych)
require(ggiraph)
require(ggiraphExtra)
require(plyr)
# read in the dataframe
data = read.csv("../data_sets/CaseStudy2-data.csv",header = TRUE)
After checking the data for duplicates and null values, it was determined that the data was clean and no additional cleaning would be required.
# check for unique employees only
distinct_df<-distinct(data, EmployeeNumber)
# check for complete observations
#df[complete.cases(data), ]
Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, OverTime, JobLevel, JobSatisfaction, JobInvolvement, PerformanceRating, RelationshipSatisfaction,StandardHours, StockOptionLevel, NumCompaniesWorked, TrainingTimesLastYear, WorkLifeBalance, Education, YearsInCurrentRole, EnvironmentSatisfaction
ID, Age, Attrition, DailyRate, DistanceFromHome, EmployeeNumber, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager
# create a dataframe with categorical, removing Over 18 and
categorical <- subset(data, select = c(Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, OverTime, JobLevel, JobSatisfaction, JobInvolvement, PerformanceRating, RelationshipSatisfaction, StockOptionLevel, NumCompaniesWorked, TrainingTimesLastYear, WorkLifeBalance, Education, YearsInCurrentRole, EnvironmentSatisfaction))
# create a dataframe with all continuous
continuous <- subset(data, select = c(ID, Age, Attrition, DailyRate, DistanceFromHome, EmployeeNumber, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager, StandardHours) )
Reviewing the plots below, the numerical factors that appear to correlate with higher attrition rates is:
The were selected due to the clustering or correlation R2 of positive attrition.
# differentiate colors by attrition
my_cols <- c("#00AFBB", "#E7B800")
df_attrition_no <- continuous %>% filter(Attrition == "No")
df_attrition_yes <- continuous %>% filter(Attrition == "Yes")
# Correlation panel
panel.cor <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# Customize upper panel
upper.panel<-function(x, y){
points(x,y, pch = 1, col = my_cols[continuous$Attrition])
}
# Customize upper panel
upper.panel<-function(x, y){
points(x,y, pch = 1, col = my_cols[df_attrition_yes$Attrition])
}
pairs(df_attrition_yes[,1:5], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business
pairs(df_attrition_yes[,5:10], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business
pairs(df_attrition_yes[,10:15], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business
It’s apparent that only specific variables are significant in terms of attrition. They variables are listed below: * Job Role Marital Status Overtime Job Level * Job Involvement Stock Option Level These variables has the most stark differences when it comes to the percentage differences between the variables.
It is interesting to note that gender, education, and job satisfaction didn’t have a major impact on attrition.
#compare the plots with attrition being the explainatory factor (Categorical)
data$AttritionF <- as.factor(data$Attrition)
categorical_variables = colnames(categorical)
graphing <- function(data, categorical_variables) {
for (val in categorical_variables) {
plot <- ggplot(data=data)+aes_string(val)+geom_bar(aes(fill=as.factor(AttritionF)), position="fill")+ ggtitle(glue("{val}"))
print(plot)
}
}
graphing(data, categorical_variables)
##### Test variables for relevance Pull the variables visually analyzed and test for relevance in the final model. Since we are testing for a categorical variable, KNN, is utlized.
Scaling the data to prevent loss of patterns that would not been seen if not completed. Extract the data that’s determined the most relevant from the analysis above.
# Scale the data
# scale the data
data$MonthlyIncomeScaled <- scale(data$MonthlyIncome)
data$PercentSalaryHikeScaled <- scale(data$PercentSalaryHike)
data$TotalWorkingYearsScaled <- scale(data$TotalWorkingYears)
data$YearsAtCompanyScaled <- scale(data$YearsAtCompany)
data$YearsSinceLastPromotionScaled <- scale(data$YearsSinceLastPromotion)
# grab all significant variables for testing
variables_graphed <- subset(data, select = c(ID, Attrition, JobRole, MaritalStatus, OverTime, JobLevel, JobInvolvement, StockOptionLevel, WorkLifeBalance, MonthlyIncome, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, MonthlyIncomeScaled, PercentSalaryHikeScaled, TotalWorkingYearsScaled, YearsAtCompanyScaled, YearsSinceLastPromotionScaled))
need_dummycode <- c("JobRole", "MaritalStatus", "OverTime")
for (val in need_dummycode) {
temp <- as.data.frame(dummy.code(data[[val]]))
variables_graphed <- cbind(variables_graphed, temp)
}
Its determined that K-15 results is the best KNN-model. * Monthly Salary * Overtime * Role
at K = 7, seed = 7
# cross validation knn
set.seed(7)
splitPerc = .80
#split data
trainIndices = sample(1:dim(variables_graphed)[1],round(splitPerc * dim(variables_graphed)[1]))
train = variables_graphed[trainIndices,]
test = variables_graphed[-trainIndices,]
# create dataframe to hold KNN results
accs = data.frame(accuracy = numeric(20), k = numeric(20))
# run KNN algorithm
for(i in 1:20)
{
classifications = knn(train[,c(32,33,15,20,21,22,23,24,25,26,27,28)],test[,c(32,33,15,20,21,22,23,24,25,26,27,28)],train$Attrition, prob = TRUE, k = i)
#print(table(test$Attrition,classifications))
tryCatch({ CM <- confusionMatrix(table(test$Attrition,classifications))
accs$accuracy[i] = CM$overall[1]
accs$specificity[i] = CM$byClass[2]
accs$sensitivity[i] = CM$byClass[1]
accs$k[i] = i},
error = function(e) {
accs$accuracy[i] = 0
accs$specificity[i] = 0
accs$sensitivity[i] = 0
accs$k[i] = 0
})
#CM <- confusionMatrix(table(test$Attrition,classifications))
}
accs
## accuracy k specificity sensitivity
## 1 0.7528736 1 0.2222222 0.8913043
## 2 0.8620690 2 0.4615385 0.8944099
## 3 0.8448276 3 0.3888889 0.8974359
## 4 0.8678161 4 0.5000000 0.8902439
## 5 0.8563218 5 0.4375000 0.8987342
## 6 0.8735632 6 0.5384615 0.9006211
## 7 0.8563218 7 0.4375000 0.8987342
## 8 0.8850575 8 0.6363636 0.9018405
## 9 0.8735632 9 0.5384615 0.9006211
## 10 0.8850575 10 0.6363636 0.9018405
## 11 0.8793103 11 0.5833333 0.9012346
## 12 0.9022989 12 0.8750000 0.9036145
## 13 0.9022989 13 0.8750000 0.9036145
## 14 0.9022989 14 0.8750000 0.9036145
## 15 0.9022989 15 0.8750000 0.9036145
## 16 0.9022989 16 0.8750000 0.9036145
## 17 0.9022989 17 0.8750000 0.9036145
## 18 0.9022989 18 0.8750000 0.9036145
## 19 0.9022989 19 0.8750000 0.9036145
## 20 0.8908046 20 0.8333333 0.8928571
plot(accs$k,accs$accuracy, type = "l", xlab = "k", main="K Iterations and Accuracy")
accs1 = data.frame(accuracy = numeric(20), k = numeric(20))
# run KNN algorithm
for(i in 1:20)
{
# set seed
set.seed(i)
splitPerc = .80
#split data
trainIndices = sample(1:dim(variables_graphed)[1],round(splitPerc * dim(variables_graphed)[1]))
train = variables_graphed[trainIndices,]
test = variables_graphed[-trainIndices,]
# run KNN algorithm
classifications = knn(train[,c(32,33,15,20,21,22,23,24,25,26,27,28)],test[,c(32,33,15,20,21,22,23,24,25,26,27,28)],train$Attrition, prob = TRUE, k = 7)
#print(table(test$Attrition,classifications))
tryCatch({ CM <- confusionMatrix(table(test$Attrition,classifications))
accs1$accuracy[i] = CM$overall[1]
accs1$specificity[i] = CM$byClass[2]
accs1$sensitivity[i] = CM$byClass[1]
accs1$seed[i] = i},
error = function(e) {
accs1$accuracy[i] = 0
accs1$specificity[i] = 0
accs1$sensitivity[i] = 0
accs1$seed[i] = 0
})
#CM <- confusionMatrix(table(test$Attrition,classifications))
}
accs1
## accuracy k specificity sensitivity seed
## 1 0.8448276 0 0.6428571 0.8625000 1
## 2 0.8563218 0 0.5714286 0.8682635 2
## 3 0.8045977 0 0.5000000 0.8271605 3
## 4 0.8735632 0 0.2857143 0.8982036 4
## 5 0.8620690 0 0.8000000 0.8658537 5
## 6 0.8793103 0 0.7058824 0.8980892 6
## 7 0.8563218 0 0.4375000 0.8987342 7
## 8 0.8793103 0 0.8333333 0.8827160 8
## 9 0.8448276 0 0.5714286 0.8687500 9
## 10 0.8505747 0 0.5000000 0.8860759 10
## 11 0.7988506 0 0.4375000 0.8354430 11
## 12 0.8965517 0 0.6666667 0.9182390 12
## 13 0.8678161 0 0.6000000 0.8841463 13
## 14 0.8850575 0 0.8181818 0.8895706 14
## 15 0.8103448 0 0.5294118 0.8407643 15
## 16 0.8505747 0 0.6666667 0.8606061 16
## 17 0.8563218 0 0.6923077 0.8695652 17
## 18 0.8505747 0 0.5294118 0.8853503 18
## 19 0.8908046 0 0.7142857 0.8982036 19
## 20 0.8735632 0 0.6363636 0.8895706 20
summary(accs1)
## accuracy k specificity sensitivity seed
## Min. :0.7989 Min. :0 Min. :0.2857 Min. :0.8272 Min. : 1.00
## 1st Qu.:0.8491 1st Qu.:0 1st Qu.:0.5221 1st Qu.:0.8650 1st Qu.: 5.75
## Median :0.8563 Median :0 Median :0.6182 Median :0.8834 Median :10.50
## Mean :0.8566 Mean :0 Mean :0.6069 Mean :0.8764 Mean :10.50
## 3rd Qu.:0.8750 3rd Qu.:0 3rd Qu.:0.6957 3rd Qu.:0.8917 3rd Qu.:15.25
## Max. :0.8966 Max. :0 Max. :0.8333 Max. :0.9182 Max. :20.00
ggplot(data = data, aes(x=JobRole, y=MonthlyIncome, color=Attrition)) + geom_point(position="jitter") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(data = data, aes(x=OverTime, y=MonthlyIncome, z=JobRole, color=Attrition)) + geom_point(position="jitter")
Attrition is defined as reducing/preventing voluntary employee turnover. Here, if the value is no, this means that the employee was not lost, and yes means that the company did lose the employee.
While many factors were considered to lead to attrition, it was determined that the top 3 factors are 1. Monthly Income 2. Over Time 3. Job Role
The final model takes into account all three factors, with a mean specificity of 0.61 and sensitivity of 0.87 at K-7.
# select desired columns from the data sets
training_set <- train %>% select(c("OverTime", "JobRole", "MonthlyIncome"))
testing_set <- test %>% select(c("OverTime", "JobRole", "MonthlyIncome"))
#since the data needs to be dummy coded for categorical, the following is more accurate
training <- train[,c(32,33,15,20,21,22,23,24,25,26,27,28)]
testing<- test[,c(32,33,15,20,21,22,23,24,25,26,27,28)]
# run the data
CM <- knn(training,testing,train$Attrition, prob = TRUE, k = 7)
# save the column to the desired dataset
testing$prediction<- CM
At minimum, the attrition should be at 60% and the sensitivity should be 60%. Aftering running the model to recieve predicted attrition values, the dataset was ordered by id and exported to a csv.
# load in the attrition data
attrition_competition = read.csv("../data_sets/CaseStudy2CompSet No Attrition.csv", header=TRUE)
# order by id
attach(attrition_competition)
attrition_competition_ordered <- attrition_competition[order(ID),]
# run predictions on the competition model
# 1. dummy code data
need_dummycodes <- c("JobRole", "OverTime")
for (val in need_dummycodes) {
temp <- as.data.frame(dummy.code(attrition_competition_ordered[[val]]))
attrition_competition_ordered <- cbind(attrition_competition_ordered, temp)
}
# 2. Scale the monthly income
attrition_competition_ordered$MonthlyIncomeScaled <- scale(attrition_competition_ordered$MonthlyIncome)
# 3. pull the train and validation sets
training <- train[,c(32,33,15,20,21,22,23,24,25,26,27,28)]
testing <- attrition_competition_ordered %>% select(c("Research Scientist","Sales Executive","Laboratory Technician","Manufacturing Director","Manager","Healthcare Representative","Research Director","Sales Representative","Human Resources","No","Yes","MonthlyIncomeScaled"))
# 4. Run the model to get predictions
CM <- knn(training,testing,train$Attrition, prob = TRUE, k = 7)
# 5. save the column to the desired dataset
attrition_competition_ordered$Attrition<- CM
# 6. filter for desired columns
output_attrition <- attrition_competition_ordered %>% select(c("ID", "Attrition"))
# export the attrition predicted data
write.csv(output_attrition, "../data_sets/output_data_sets/Case2Predictions_JamieVo Attrition.csv")
# 1. dummy code all the data
dummy_code2 <- c("BusinessTravel","Department","EducationField","Gender","JobRole","MaritalStatus","OverTime")
for (val in dummy_code2) {
temp <- as.data.frame(dummy.code(data[[val]]))
data <- cbind(data, temp)
}
data$attrition_no <- ifelse(data$Attrition =="No", 1, 0) #dummy code attrition
names(data)<-make.names(names(data),unique = TRUE) # remove spaces from the name
# 2. initial reduction of factors
scatter.smooth(x=data$TotalWorkingYears, y=data$MonthlyIncome, main="")
lm_step1 <- step(lm(MonthlyIncome~Age
+ DailyRate + DistanceFromHome + Education + EmployeeNumber
+ EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobLevel
+ JobSatisfaction + MonthlyRate + NumCompaniesWorked
+ PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StandardHours
+ StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance
+ YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager
+ Travel_Rarely + Travel_Frequently
+ Non.Travel + Research...Development + Sales + Human.Resources
+ Life.Sciences + Medical + Marketing + Technical.Degree
+ Other + Human.Resources + Male + Female
+ Sales.Executive + Research.Scientist + Laboratory.Technician + Manufacturing.Director
+ Healthcare.Representative + Sales.Representative + Manager + Research.Director
+ Human.Resources + Married + Single + Divorced
+ No + Yes + attrition_no, data = data),direction="both")
An interesting note is that Males are in the model, while females are not. We find that the sex of the candidate affects the model.
# 3. customizing model and checking all assumptions. Here, the switch
# split data into train adn test set
set.seed(8)
splitPerc = .80
#split data
trainIndices = sample(1:dim(data)[1],round(splitPerc * dim(data)[1]))
train = data[trainIndices,]
test = data[-trainIndices,]
# Add remove any rates considering they likely feed into the monthly income, this is not helpful
lm_back <- step(lm( MonthlyIncome ~ DistanceFromHome +
JobLevel + PercentSalaryHike + PerformanceRating +
TotalWorkingYears + YearsSinceLastPromotion + YearsWithCurrManager +
Travel_Rarely + Travel_Frequently + Male + Sales.Executive +
Laboratory.Technician + Manufacturing.Director + Healthcare.Representative +
Manager + Research.Director, data = train), direction = "backward")
lm_forward <- step(lm(MonthlyIncome ~ JobLevel + TotalWorkingYears + YearsSinceLastPromotion +
YearsWithCurrManager + Travel_Rarely + Male + Sales.Executive +
Laboratory.Technician + Manufacturing.Director + Healthcare.Representative +
Manager + Research.Director, data = train), direction = "forward")
# run the linear regression
linearMod <- lm(MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely, data=train) # build linear regression model on full data
print(linearMod)
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely,
## data = train)
##
## Coefficients:
## (Intercept) JobLevel TotalWorkingYears Travel_Rarely
## -2029.79 3704.57 56.68 303.75
summary(linearMod)
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5310.4 -846.6 24.4 773.0 3805.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2029.79 133.56 -15.198 < 2e-16 ***
## JobLevel 3704.57 74.99 49.400 < 2e-16 ***
## TotalWorkingYears 56.68 10.90 5.202 2.59e-07 ***
## Travel_Rarely 303.75 113.10 2.686 0.00741 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1348 on 692 degrees of freedom
## Multiple R-squared: 0.9149, Adjusted R-squared: 0.9145
## F-statistic: 2478 on 3 and 692 DF, p-value: < 2.2e-16
Overall, it can be argued that no major deviations from the assumptions and the linear regression model has a good fit.
# 1, Normality
sresid <- studres(linearMod)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
# 2. Constant Variance
plot(sresid) + title("Rediduals for Constant Variance")
## integer(0)
# 3. Linearity
ggPredict(linearMod,se=TRUE,interactive=TRUE)
## Warning: package 'gdtools' was built under R version 3.6.2
# 4. Outliers
plot(linearMod)
# prediction of the model
require(graphics)
pred <- predict(linearMod, test)
actuals_preds <- data.frame(cbind(actuals=test$MonthlyIncome, predicteds=pred)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds) # 94.38%
head(actuals_preds)
## actuals predicteds
## 2 19626 17987.171
## 10 5063 5832.822
## 14 2476 2035.214
## 15 3423 2545.369
## 22 8120 6363.307
## 24 5332 5946.190
summary(linearMod)
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5310.4 -846.6 24.4 773.0 3805.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2029.79 133.56 -15.198 < 2e-16 ***
## JobLevel 3704.57 74.99 49.400 < 2e-16 ***
## TotalWorkingYears 56.68 10.90 5.202 2.59e-07 ***
## Travel_Rarely 303.75 113.10 2.686 0.00741 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1348 on 692 degrees of freedom
## Multiple R-squared: 0.9149, Adjusted R-squared: 0.9145
## F-statistic: 2478 on 3 and 692 DF, p-value: < 2.2e-16
# Calculate RMSE
RSS <- c(crossprod(linearMod$residuals))
MSE <- RSS / length(linearMod$residuals)
RMSE <- sqrt(MSE)
The model attains a RMSE < $3000 for the training and the validation set.
# load in the monthly income data
monthly_income_competition = read.csv("../data_sets/CaseStudy2CompSet No Salary.csv", header=TRUE)
# order by id
attach(attrition_competition)
## The following objects are masked from attrition_competition (pos = 4):
##
## Age, BusinessTravel, DailyRate, Department, DistanceFromHome,
## Education, EducationField, EmployeeCount, EmployeeNumber,
## EnvironmentSatisfaction, Gender, HourlyRate, ID, JobInvolvement,
## JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome,
## MonthlyRate, NumCompaniesWorked, Over18, OverTime,
## PercentSalaryHike, PerformanceRating, RelationshipSatisfaction,
## StandardHours, StockOptionLevel, TotalWorkingYears,
## TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany,
## YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager
monthly_income_competition_ordered <- monthly_income_competition[order(ID),]
# transform the data
# 1. dummy code all the data
dummy_code2 <- c("BusinessTravel","Department","EducationField","Gender","JobRole","MaritalStatus","OverTime")
for (val in dummy_code2) {
temp <- as.data.frame(dummy.code(monthly_income_competition_ordered[[val]]))
monthly_income_competition_ordered <- cbind(monthly_income_competition_ordered, temp)
}
# predict the values
pred <- predict(linearMod, monthly_income_competition_ordered)
monthly_income_competition_ordered$MonthlyIncome = pred
output1<- as.data.frame(cbind(ID=monthly_income_competition_ordered$ID, MonthlyIncome=monthly_income_competition_ordered$MonthlyIncome))
# export the attrition predicted data
write.csv(output1, "../data_sets/output_data_sets/Case2Predictions_JamieVo Salary.csv")
In term of job specific roles, the following trends were nnocied:
#compare the plots with attrition being the explainatory factor (Categorical)
data$JobF <- as.factor(data$JobRole)
categorical_variables = colnames(categorical)
variables <- c(
"Attrition","BusinessTravel","EducationField"
,"Gender","MaritalStatus","OverTime"
,"JobLevel","JobSatisfaction","JobInvolvement","PerformanceRating"
,"RelationshipSatisfaction","StockOptionLevel","NumCompaniesWorked","TrainingTimesLastYear"
,"WorkLifeBalance","Education","YearsInCurrentRole","EnvironmentSatisfaction" )
graphing <- function(data, variables) {
for (val in variables) {
plot <- ggplot(data=data)+aes_string(val)+geom_bar(aes(fill=as.factor(JobF)), position="fill")+ ggtitle(glue("{val}"))
print(plot)
}
}
graphing(data, variables)
# numerical data
continuous <- subset(data, select = c(ID, Age, JobRole, DailyRate, DistanceFromHome, EmployeeNumber, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager, StandardHours) )
# differentiate colors by attrition
my_cols <- c("#00AFBB", "#E7B800", "#F0F8FF", "#8FBC8F", "#483D8B", "#00008B", "#006400", "#B8860B", "#A52A2A")
# Correlation panel
panel.cor <- function(x, y){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# Customize upper panel
upper.panel<-function(x, y){
points(x,y, pch = 1, col = my_cols[continuous$JobRole])
}
# Customize upper panel
upper.panel<-function(x, y){
points(x,y, pch = 1, col = my_cols[continuous$JobRole])
}
pairs(continuous[,1:5], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business
pairs(continuous[,5:10], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business
pairs(continuous[,10:15], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. busine
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
ggplot(data, aes(x=TotalWorkingYears, y=MonthlyIncome, color=JobRole)) + geom_point()
There does appear to be a trend that Reserach directors and Managers tend to work more total years, and hence, have higher monthly incomes. There is a stronger cluster of sales reps and technicians on the lower end of total working years and monthly income. This may, and hopefully be due to the fact that they’ve moved onto management and or higher roles.
In conclusion, it was determined that the three factors that provide high specificity and sensitivity to attrition are monhtly salary, overtime, and job role. Those who worked over time tended to have higher attrition rates, along with those in the lower brackets of monthly oncome. As for job roles, it can be seen in the graphs above that sales reps have high attrition rate. The KNN model designed looked a the nearest 7 neighbors to determine the attrition classification, resulting in ~60% sensitivity and ~87% specificity.
The most important predictors for monthly income is whether the employee rarely travels, their job level and their total years working. This is intuitive, but interesting to see that there essentially wasn’t a major emphasis on role. The multiple linear regression model showed that roughly 91% of the montly income could be explained by the variables included, with an RMSE of ~$1400.00.
For job specific roles, there were not major trends noticed in the initial analysis. An interesting point is that Research Directors and Managers tend to have worked a greater number of total years and monthly income. The cluster of sales and technicians towards lower work years and monthly income would hopefully be explained by their movement up into management.